code
share


β–Ί Chapter 12: Introduction to nonlinear learning

12.5 Universal approximation

In Sections 12.1 - 12.3 we saw how supervised and unsupervised learners alike can be extended to perform nonlinear learning via the use of arbitrary linear combination of nonlinear functions / feature transformations. However the general problem of engineering an appropriate nonlinearity for a given dataset has thus far remained elusive. In this Section we introduce the first of two major tools for handeling this task: universal approximators. Universal approximators are families of simple nonlinear feature transformations whose members can be combined to create artbitrarily complex nonlinearities like any we would ever expect to find in a supervised or unsupervised learning dataset. Here we will also introduce the three standard types of universal approximators employed in practice today - kernels, neural networks, and trees - of which we will have much more to say of in future Chapters.

InΒ [85]:
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

12.5.1 Universal approximationΒΆ

Virtually every complicated object in the world around us can be decomposed into simpler elements. An automatible consists of components like windows, tires, an engine, and so on. A car engine can be broken down a combination of yet simpler components, into e.g., a set of pistons, an alternator, etc., Even the base alloys used to construct these engine components can be broken down into combinations of metals, and these metals into combinations of elements from the periodic table.

Complicated mathematical objects - i.e., mathematical functions - can be similarly broken down into combinations of simpler elements, i.e., into combinations of simpler mathematical functions. Stated more formally, any nonlinear mathematical function $h\left(\mathbf{x}\right)$ can be broken down into / approximated by a linear combination of $B$ 'simple' functions $f_1,\,f_2,\,...\,f_B$ in general as

\begin{equation} h\left(\mathbf{x}\right) = w_0 + f_1\left(\mathbf{x}\right){w}_{1} + f_2\left(\mathbf{x}\right){w}_{2} + \cdots + f_B\left(\mathbf{x}\right)w_B \end{equation}

where each of the simpler nonlinear functions on the right hand side can have internal parameters. This is actually a rather old mathematical fact (e.g., with roots in the 1700s) which we call universal approximation in the machine learning / deep learning world. These simple functions are called universal approximators.

For example, below we show a continuous nonlinear function $h$ in black, and approximate this function using $B = 200$ universal approximators. As you move the slider from left to right you will see a red curve jump onto the screen - this is the combination of our $200$ universal approximators shown first with a poor choice of parameter settings. As yo move the slider from left to right we illustrate how the shape of this combination improves (in terms of how it approximates $h$) as we improve its parameters. Eventually - moving the slider all the way to the right where we show an especially good setting of these parameters - the combination approximates $h$ quite well (and indeed if we kept refining its parameters it could perfectly represent $h$).

InΒ [80]:
Out[80]:



Even functions with discrete jumps - like a step function - can be approximated infinitely closely using a combination of universal approximators. Below we show a similar animation to the one above, only there we approximate a step function with nonlinear jumps. Here we use $B=100$ universal approximators, and once again as you move the slider from left to right the parameters of the combination are improved so that the approximation looks more and more like the function $h$ itself.

InΒ [83]:
Out[83]:



What are these 'simple' universal approximators $f_1,\,f_2,\,...\,f_B$? How many do we need (how large does $B$ have to be) for a given nonlinear $h$? How do we set the parameters of this linear combination correctly (including possible internal parameters of the the functions $f_1,\,f_2,\,...\,f_B$)? What does this have to do with determining a proper nonlinearity for a supervised/unsupervised dataset? We address each of these questions below.

12.5.2 Universal approximatorsΒΆ

There is an enormous variety of simple functions $f_1,\,f_2,\,...\,f_B$ that fall into the category of universal approximators. Typically these are organized into three families or catalogs of similar functions called kernels, neural networks, and trees respectively. Roughly speaking, using an exemplar from any one of these families we can perform the kind of universal approximation detailed above.

The kernel family of universal approximatorsΒΆ

The family of kernel functions consists of groups of functions with no internal parameters, a popular example being polynoimals. When dealing with just one input this sub-family of kernel functions looks like

\begin{equation} f_1(x) = x, ~~ f_2(x) = x^2,~~ f_3(x)=x^3,... \end{equation}

and so forth with the $D^{th}$ element taking the form $f_D(x) = x^D$. A combination of the first $D$ units from this sub-family is often referred to as a degree $D$ polynomial. There are an infinite number of these functions - one for each positive whole number - and they are naturally ordered by their degree (the higher the degree, the further down the list that function is). Moreover each element has a fixed shape - e.g., the monomial $f_2(x) = x^2$ is always a parabola, and looks like a cup facing upwards. In the next Python cell we plot the first four elements (also called units) of the polynomial sub-family of kernel functions.

InΒ [93]:

In two inputs $x_1$ and $x_2$ polynomial units take the analagous form

\begin{equation} f_1(x_1,x_2) = x_1, ~~ f_2(x_1,x_2) = x_2^2, ~~ f(x_1,x_2) = x_1x_2^3, ~~ f(x_1,x_2) = x_1^4x_2^6,... \end{equation}

with a general degree $D$ unit taking the form

\begin{equation} f_m(x_1,x_2) = x_1^px_2^q \end{equation}

where where $p$ and $q$ are nonnegative integers and $p + q \leq D$. A degree $D$ polynomial in this case is a linear combinatino of all such units. This sort of definition generalizes to defining polynomial units in general $N$ dimensional input as well. In the Python cell below we draw a sampling of these polynomial units.

InΒ [3]:

Another classic sub-family of kernel universal approximators: sine waves of increasing frequency. This consists of the set of sine waves with frequency increasing by an e.g., integer factor like

$$f_1(x) = \text{sin}(x), ~~ f_2(x) = \text{sin}(2x), ~~ f_3(x) = \text{sin}(3x), ...$$

where the $m^{th}$ element given as $f_m(x) = \text{sin}(mx)$.

In the next Python cell we plot the table of values for the first four of these catalog functions using their equations.

InΒ [91]:

As with the polynomials, notice how each of these catalog of elements if fixed. They have no tunable parameters inside, the third element always looks like $f_3(x) = \text{sin}(x)$ - that is it always takes on that shape. Also note, like polynomials to generalize this catalog of functions to higher dimensional input we shove each coordinate through the single dimensional version of the function separately. So in the case of $N=2$ inputs the functions take the form

\begin{equation} f_1(x_1,x_2) = \text{sin}(x1), ~~ f_1(x_1,x_2) = \text{sin}(2x_1)\text{sin}(5x_2), ~~ f_3(x_1,x_2) = \text{sin}(4x_1)\text{sin}(2x_2), ~~ f_4(x_1,x_2) = \text{sin}(7x_1)\text{sin}(x_2), ~~ ... \end{equation}

And these are listed in no particular order, and in general we can write a catalog element as $f_m(x_1,x_2) = \text{sin}(px_1)\text{sin}(qx_2) $ where $p$ and $q$ are any nonnegative integers.

The neural network family of universal approximatorsΒΆ

Another family of universal approximators are neural networks. Broadly speaking neural networks consist of parameterized functions whose members - since they are parameterized - can take on a variety of different shapes (unlike kernel functions which each take on a single fixed form). A sub-family of neural networks typically consists of a set of parameterized functions of a single type. For example, a common single layer neural network basis consists of parameterized $\text{tanh}$ functions

\begin{equation} f_1(x) = \text{tanh}\left(w_{1,0} + w_{1,1}x\right), ~~ f_2(x) = \text{tanh}\left(w_{2,0} + w_{2,1}x\right), ~~ f_3(x) = \text{tanh}\left(w_{3,0} + w_{3,1}x\right), ~~ f_4(x) = \text{tanh}\left(w_{4,0} + w_{4,1}x\right), ... \end{equation}

Notice that because there are parameters inside the $\text{tanh}$ the $b^{th}$ such function $f_b$ can take on a variety of shapes depending on how we set its internal parameters $w_{b,0}$ and $w_{b,1}$. We illustrate this in the next Python cell by randomly setting these two values and plotting the table of the associated function.

InΒ [94]:

To handle higher dimensional input we simply take a linear combination of the input, passing the result through the nonlinear function. For example, an element $f_j$ for general $N$ dimensional input looks like

\begin{equation} f_j\left(\mathbf{x}\right) = \text{tanh}\left(w_{j,0} + w_{j,1}x_1 + \cdots + w_{j,\,N}x_N\right) \end{equation}

Choosing another elementary function gives another sub-catalog of single-layer neural network functions. The rectified linear unit (or 'relu' for short) is another popular example, elements of which (for single dimensional input) look like

\begin{equation} f_1(x) = \text{max}\left(0,w_{1,0} + w_{1,1}x\right), ~~ f_2(x) = \text{max}\left(0,w_{2,0} + w_{2,1}x\right), ~~ f_3(x) = \text{max}\left(0,w_{3,0} + w_{3,1}x\right), ~~ f_4(x) = \text{max}\left(0,w_{4,0} + w_{4,1}x\right), ... \end{equation}

Since these also have internal parameters each can once again take on a variety of shapes. Below we plot $4$ instances of such a function, where in each case its internal parameters have been set at random.

InΒ [104]:

Other 'deeper' elements of the neural network family are constructed by recursively combining single layer elements. For example, to create a two-layer neural network function we take a linear combination of single layer elements, like the $\text{tanh}$ ones shown above, and pass the result through same $\text{tanh}$ function e.g., summing $10$ elements and passing the result through $\text{tanh}$ gives a 'two-layer' function. With this we have created an even more flexible function, since each internal single-layer function $f_j$ also has internal parameters, which we illustrate by example below. Here we show $4$ random instances of the above neural network function, where in each instance we have randomly set all of its parameters. As can be seen below, these instances are far more flexible than the single-layer ones.

We go into substantially further detail in discussing this family of universal approximators in Chapter 13.

InΒ [129]:

The tree family of universal approximatorsΒΆ

Like neural networks, a single element from the family of tree-based universal approximators can take on a wide array of shapes. The simplest sort of tree basis consists of discrete step functions or, as they are more commonly referred to, stumps whose break lies along a single dimension of the feature space. A stump with 1-dimensional input $x$ can be written as

\begin{equation} f_1(x) = \begin{cases} x < V_1 \,\,\,\,\, a_1 \\ x \geq V_1 \,\,\,\,\, b_1 \end{cases} ~~~~~~~~ f_2(x) = \begin{cases} x < V_2 \,\,\,\,\, a_2 \\ x \geq V_2 \,\,\,\,\, b_2 \end{cases} ~~ \cdots \end{equation}

where $V_{1}$ is split point at which the stump changes values, and $y_{1}$ and $y_{2}$ are values taken by the two sides of the stump, respectively, which we refer to as levels of the stump.

Figure 1: An illustration of a *stump* function from the family of tree-based universal approimators. Here $V_1$ is called the *split point* and $y_1$ / $y_2$ the *levels* of the function.

In the python cell that follows we plot four instances of such a stump basis element - where we can see how each one takes on a wide variety of shapes.

InΒ [112]:
InΒ [Β ]:
 
InΒ [118]:

12.5.1 The perfect kind of data for learningΒΆ

In a perfect world our desired approximations $\approx$ can attain a strict equality $=$ i.e., $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$. In such a perfect instance the dataset lies precisely on a line (or more generally, a hyperplane). The most perfect dataset we could possibly have for linear regression is then - ignoring computation - a line (or hyperplane) itself (as illustrated below in the animated gif).

In complete analogy to the linear case, here our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p, \mathbf{w}\right) = y_p$, i.e., points lying perfectly on the nonlinear curve (or in general the nonlinear surface) defined by $f$. The most perfect dataset we could possibly have for nonlinear regression is then - ignoring computation - a curve (or surface) itself (as illustrated below in the animated gif).

Figure 1: In this gif we start off by showing a realistic linear regression dataset (a small noisy set of points that can be roughly modeled by a line). We then progress to remove all noise from those points (making them all lie perfectly on some line), and finally show the perfect linear regression scenario where we have infinitely many points lying perfectly on a line. In this gif we start off by showing a realistic nonlinear regression dataset (a small noisy set of points that can be roughly modeled by a nonlinear curve). We then progress to remove all noise from those points (making them all lie perfectly on some curve), and finally show the perfect nonlinear regression scenario where we have infinitely many points lying perfectly on the curve itself.

Examining our setup, here for linear two-classification our perfect dataset would consist of points where $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = y_p$, i.e., points lying perfectly on the step function with linear boundary given by $\text{model}\left(\mathbf{x},\mathbf{w}\right) = 0$. The most perfect dataset we could possibly have for linear classification is then - ignoring computation - a perfect step function (with linear boundary) itself. This is illustrated below in the animated gif.

Examining our setup, here for nonlinear two-classification our perfect dataset would consist of points where the nonlinear decision boundary given by $\text{model}\left(\mathbf{x}_p,\mathbf{w}\right) = 0$ perfectly separates the data. In other words, that our points lie perfectly on the step function with this nonlinear boundary. The most perfect dataset we could possibly have for nonlinear classification is then - ignoring computation - a perfect step function (with nonlinear boundary) itself. This is illustrated below in the animated gif.

Figure 1: In this gif we start off by showing a realistic linear classification dataset (a small noisy set of points that can be roughly modeled by a step function). We then progress to remove all noise from those points (i.e., returning the true label values to our 'noisy' points), and finally show the perfect linear classification scenario where we have infinitely many points lying perfectly on a step function with linear boundary. In this gif we start off by showing a realistic nonlinear classification dataset (a small noisy set of points that can be roughly modeled by a step function with nonlinear boundary). We then progress to remove all noise from those points (i.e., returning the true label values to our 'noisy' points), and finally show the perfect nonlinear classification scenario where we have infinitely many points lying perfectly on a step function with nonlinear boundary.

3D CLASSIFICATION FIT GOES HERE

OLD EXAMPLESΒΆ

Example 1: Various predictions using polynomial unitsΒΆ

In this example we animate various predictions of $B$ polynomial units, which generally take the form

\begin{equation} \text{predict}\left(\mathbf{x}, \omega \right) = w_0 + {w}_{1}\,f_1\left(\mathbf{x}\right) + {w}_{2}\,f_2\left(\mathbf{x}\right) + \cdots + w_B\,f_B\left(\mathbf{x}\right). \end{equation}

to a variety of simple datasets. Notice in each how - due to the fact that these features come from the same sub-family of related nonlinear functions - that the corresponding predictions change fairly gradually as additional units are added to the model (in other words - we how can fine-tune the amount of nonlinearity we want in our prediction).

First in the next cell we animate the final trained predictions of fitting first $B = 1$ through $B = 4$ polynomial units to the noisy sinusoid dataset shown in the previous Subsection. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data in the left panel. In the right panel we simultaenously show the Least Squares cost function value of each trained model.

InΒ [4]:
demo = nonlib.regression_basis_single.Visualizer()
csvname =  datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,5)])
Out[4]:



Next, below in the following Python cell we animate the final trained predictions of fitting first $B = 1$ through $B = 10$ polynomial units to a three-dimensional noisy sinusoid dataset. As the slider is moved from left to right one additional polynomial unit is added to the model, the linear combination of units is fit to the dataset, and the resulting fit is shown on the data. We do not plot the corresponding Least Squares cost function value, but as with the above example it decreases as we add more polynomial units to the model.

InΒ [6]:
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname =  datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_units =  [v for v in range(1,10)] ,view = [20,110],basis = 'poly')
Out[6]:



And finally, an analagous example with 2-class classification. As you move the slider from left to right additional polynomial units are added to the model. In this case each notch of the slider adds multiple units - as we are increasing in $D$ the degree of the total polynoimal (which, for two inputs, means adding several individual polynomiali units each time $D$ is increased).

InΒ [9]:
# load in dataset  http://math.arizona.edu/~dsl/
csvname = datapath + '2_eggs.csv'
demo = nonlib.classification_basis_comparison_3d.Visualizer(csvname)

# run animator
demo.brows_single_fits(num_units =  [v for v in range(0,4)], basis = 'poly',view = [30,-80])
Out[9]:



InΒ [76]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tanh',num_elements = [v for v in range(1,5)])
Out[76]:



  • For two inputs - $x_1$ and $x_2$ - the catalog of single layer functions with relu activation
\begin{equation} f_1(x\,,\theta_1) = \text{tanh}\left(w_{1,0} + w_{1,1}x_1 + w_{1,2}x_2\right), ~~ f_2(x\,,\theta_2) = \text{tanh}\left(w_{2,0} + w_{2,1}x_1 + w_{2,2}x_2\right) ... \end{equation}
  • or likewise
\begin{equation} f_1(x\,,\theta_1) = \text{max}\left(0,w_{1,0} + w_{1,1}x_1 + w_{1,2}x_2\right), ~~ f_2(x\,,\theta_2) = \text{max}\left(0,w_{2,0} + w_{2,1}x_1 + w_{2,2}x_2\right) ... \end{equation}
InΒ [46]:
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements =  [v for v in range(1,12)] ,view = [20,110],basis = 'net')
Out[46]:



3. TreesΒΆ

  • Basic example: depth-1 tree, also known as a stump
\begin{equation} f_1(x) = \begin{cases} x < V_1 \,\,\,\,\, a_1 \\ x \geq V_1 \,\,\,\,\, b_1 \end{cases} ~~~~~~~~ f_2(x) = \begin{cases} x < V_2 \,\,\,\,\, a_2 \\ x \geq V_2 \,\,\,\,\, b_2 \end{cases} ~~ \cdots \end{equation}
  • Vocab: for $f_j$ the value $V_j$ called a split point, $a_j$ and $b_j$ called levels
InΒ [82]:
# import Draw_Bases class for visualizing various basis element types
demo = nonlib.DrawBases.Visualizer()

# plot the first 4 elements of the polynomial basis
demo.show_1d_tree(depth = 1)
  • Split points and levels set directly using data, not internal parameters
  • e.g., a split point placed between two inputs, left / right level set to average of output to the left / right
InΒ [84]:
demo = nonlib.stump_visualizer_2d.Visualizer()
csvname = datapath + 'noisy_sin_raised.csv'
demo.load_data(csvname)
demo.browse_stumps()
Out[84]:



InΒ [3]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='tree',num_elements = [v for v in range(1,10)])
Out[3]:



InΒ [49]:
demo = nonlib.regression_basis_comparison_3d.Visualizer()
csvname = datapath + '3d_noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fits(num_elements =  [v for v in range(1,20)] ,view = [20,110],basis = 'tree')
Out[49]:



  • Generalizes to $N$ dimensional input (split points / levels in each individual dimension)
  • General attributes of tree bases:
    • each element is adjustable, with internal parameters pre-set greedily to data
    • leads to convex cost functions for regression / classification, trained via boosting (greedy coordinate descent)
    • great for large sparse datasets

How many elements to choose?ΒΆ

  • The more elements we use the better we fit our data (provided we optimize correctly)
  • This is good in theory, but not in practice
  • In theory -with infinite clean data and ininite computation - the more elements (of any basis) --> better fit / separation
  • In fact: in such an instance all three catalogues equal, and can perfectly represent / separate
InΒ [5]:
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'sin_function.csv'
demo.load_data(csvname)
demo.brows_fits(num_elements = [v for v in range(1,50,1)])
Out[5]:



  • With real data however more elements $\neq$ better fit
InΒ [4]:
demo = nonlib.regression_basis_comparison_2d.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(cvsname)
demo.brows_fits(num_elements = [v for v in range(1,25)])
Out[4]:



  • Nothing has 'gone wrong' here - we just do not know ahead of time how many elements is 'too little' and how many is 'too much' (in terms of nonlinearity)
  • i.e., error on training set always goes down as we add elements (provided we optimize correctly)
InΒ [7]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_fit(basis='poly',num_elements = [v for v in range(1,25)])
Out[7]:



  • When fit is not nonlinear enough / we use too few basis features --> call this underfitting
  • When fit is too nonlinear / we use too many basis features --> call this overfitting
  • The way of finding best combination of elements to use called cross-validation

The fundamentals of cross-validationΒΆ

  • A reasonable diagnosis of the overfitting/underfitting problems is that both fail at representing new data, generated via the same process by which the current data was made, that we can potentially receive in the future.
  • But of course we do not have access to any "new data we will receive in the future"
  • So we simulate this scenario by splitting our data into two subsets: a larger training set of data we already have, and a smaller testing set of data that we "will receive in the future."
  • Then, we can try a range of features, fitting each batch to the training set of known data and measure its accuracy on both the training and testing sets
  • We pick the number that gives us the lowest error on our 'future data' i.e., on our testing set
  • e.g., with a simple example and polynomials
InΒ [34]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='poly',num_elements = [v for v in range(1,10)],folds = 3)
Out[34]:



  • e.g., with a simple example and tanh units
InΒ [38]:
demo = nonlib.regression_basis_single.Visualizer()
csvname = datapath + 'noisy_sin_sample.csv'
demo.load_data(csvname)
demo.brows_single_cross_val(basis='tanh',num_elements = [v for v in range(1,10)],folds = 3)
Out[38]:



How to choose training/testing split?ΒΆ

  • Cut dataset randomly into $K$ non-overlapping folds, use one fold for test $K-1$ for train

  • How many folds? typically between 3 and 10
  • The smaller / less representative the dataset the more folds should be used i.e., the fewer datapoints we can pretend to 'receive in the future' i.e., the test set
  • The extreme case with very small datasets: 'leave-one-out cross-validation', test set a single point!
  • For small to medium sized datasets its worth performing cross-validation several times, picking the winner on the average
  • This is called 'K-fold cross validation', perform the same routine using one fold as test / K-1 as training cycling the testing fold
  • Reduces the chance that we pick a non-representative testing set (i.e., that we do not overfit to the test set)
  • Very computationally intensive!